Agenda
Announcements
- MDSR Ch 9 programming notebook assigned
- MDSR Ch 8 & 9 Exercises assigned
- midterm exam
- mix of in-class and take-home
- in-class portion Wed 2/27
- take-home portion due Fri 3/1 at 11:59pm
- class will not meet on Friday 3/1
MDSR Ch 9 Errata / Tips
- Some sections don’t require programming, but please still include the headers for navigation purposes
- p. 206: the
destfile
argument probably won’t work for you as written. You can
- (recommended) create a folder called “data” in the same directory as your .Rmd file
- delete
data/
and use destfile = fueleconomy.zip
instead. You’ll need to update subsequent functions on p. 206 and 207 in kind when attempting to access files in a subdirectory called data
- Section 9.2.2: my results aren’t identical to those in the book… maybe I made a mistake or maybe the data has changed. Namely, the clusters are similar but don’t exactly match on close inspection.
Statistical Learning (recall)
Q: What are some differences between Supervised and Unsupervised Learning?
Statistical Learning (recall)
- statistical learning refers to a vast set of tools for understanding data
- Typical distinction between Supervised and Unsupervised learning
- Supervised learning predictive or inferential modeling
- there is a response variable (Y)
- predictive modeling: we want to anticipate the response (Y) ahead of time based on knowledge of a set of measurable inputs (X’s)
- inferential modeling: we want to understand the way our response (Y) changes as the explanatory variables (X’s) change
- Unsupervised learning methods could be considered “data discovery” models
- there is NO response variable (Y)
- we are interested in exposing interesting relationships/groups/clusters among several explanatory variables (X’s)
Unsupervised Learning
- Often considered within the domain of Exploratory Data Analysis
- Search for useful structure among explanatory variables: X1, X2, …,Xp.
- Informative way to visualize the data?
- Subgroups among the variables or among the observations?
- Dimension reduction
- express low-dimensional representation of the data that explains a good fraction of the variance
- commonly used for data visualization or pre-processing before supervised learning techniques are applied
- Principle component analysis (PCA)–discussed here
- Singular value decomposition (SVD)–see MDSR Ch 9, for example
- Clustering
- broad class of methods for discovering unknown subgroups within data
- Hierarchical clustering:
- k-means: specify number of groups
- Q: when would this be useful?
Challenge of Unsupervised Learning
- No “groud truth”
- in supervised learning we can check our work by evaluating predictions using cross-validation, independent test set, etc.
- typically no mechanism to “check our work” in unsupervised learning because the true answer is unknown
Some Examples
A cancer researcher might assay gene expression levels in 100 patients with breast cancer. He or she might then look for subgroups among the breast cancer samples, or among the genes, in order to obtain a better understanding of the disease.
A search engine (e.g., Google) might choose what search results to display to a particular individual based on the click histories of other individuals with similar search patterns.
Mapping evolutionary relationships among various biological species or other entities–their phylogeny–based upon similarities and differences in their physical or genetic characteristics.
Principal Components Analysis (PCA)
- Goal: summarize a large set of correlated variables with a smaller number of representative variables that collectively explain most of the variability in the original data set
- Intuition:
- many of our variables are correlated or possibly redundant
- perhaps we can rotate our data to identify
- (PC1) the axis that maximizes variability (first principal component–PC1)
- (PC2) an axis perpendicular to PC1 that maximizes the remaining variability
- and so on… until we explain all of variability in the data
- hopefully, we can explain most (or a lot) of the total variability with only the first few principal components
- sometimes we can even attempt to interpret how principal components associate (i.e., “load”) with respect to other variables in the data
- Method: PCA can be done by eigenvalue decomposition of a data covariance (or correlation) matrix or singular value decomposition (SVD) of a data matrix, usually after a normalization step of the initial data.
US Arrests by State
- we want to compare states on these four variables
- Murder
- Assault
- Rape
- UrbanPop (% of state living in urban areas)
- Q: which variable(s) would you expect to have the largest variance
data("USArrests")
head(USArrests)
PCA: US Arrest Data
- measurement units & scales significantly impact variance
- we often want to standardize the variables first
- mean = 0; sd = 1 (convert measurements to z-scores)
- this amounts to giving each variable equal weight
- mutes effect of unit conversion (km to meters impacts variance estimate)
- might not standardize if all variables are measured on same scale
First two principal components
- Q: How do PC1 and PC2 associate with our four variables?
- Note: See what happens if we DON’T standardize variables…
USArrests_pca <- USArrests %>%
prcomp(scale = TRUE) # standardize the variables
# the result is a list object
str(USArrests_pca)
List of 5
$ sdev : num [1:4] 1.575 0.995 0.597 0.416
$ rotation: num [1:4, 1:4] -0.536 -0.583 -0.278 -0.543 0.418 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
.. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4"
$ center : Named num [1:4] 7.79 170.76 65.54 21.23
..- attr(*, "names")= chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
$ scale : Named num [1:4] 4.36 83.34 14.47 9.37
..- attr(*, "names")= chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
$ x : num [1:50, 1:4] -0.976 -1.931 -1.745 0.14 -2.499 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ...
.. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4"
- attr(*, "class")= chr "prcomp"
# first two principal components
(-1) * USArrests_pca$rotation[, 1:2] %>% round(2)
PC1 PC2
Murder 0.54 -0.42
Assault 0.58 -0.19
UrbanPop 0.28 0.87
Rape 0.54 0.17
# plot of the first two principal components
USArrests_pca$x %>%
as.data.frame() %>% # `ggplot2` expects a data frame object
rownames_to_column() %>%
ggplot(aes(x = -PC1, y = -PC2)) +
geom_text(aes(label = rowname), size = 3) +
xlab("Best Vector from PCA (approx. Violent Crime)") +
ylab("2nd Best Vector from PCA (approx. Urbanization)") +
ggtitle("Two-dimensional representation of US Arrests by State")

Proportion of variance explained
- (recall) Goal: summarize a large set of correlated variables with a smaller number of representative variables that collectively explain most of the variability in the original data set
- In general, a \(n \times p\) data matrix X has \(\text{min}(n-1, p)\) distince principal components
- we want smallest number of PC’s to get a good understanding of the data
- there isn’t really an optimal solution to this problem…
- Proportion of variability explained (PVE)
- Assess PVE for each of our principal component vectors
- Can we find a point of diminishing return among principal components
- we want fewest number of PC’s that still do a good job representing the data
- informally, we look for an “elbow” in the scree plot
Scree Plot (US Arrest Data)
- Q: How did we do?
- Q: How many principal components do you think we should consider?
USArrests_pve <-
data.frame(sd = USArrests_pca$sdev) %>%
rownames_to_column() %>%
mutate(rowname = parse_number(rowname),
totalVar = sum(USArrests_pca$sdev^2),
pve = 100 * sd^2 / totalVar,
cusum = cumsum(pve))
# scree plot
USArrests_pve %>%
ggplot(aes(x = rowname, y = pve)) +
geom_line(type = 3) +
xlab("Principal Component") +
ylab("Proportion of Variance Explained") +
ggtitle("Scree Plot of Principal Components for US Arrests Data")
# cumulative PVE plot
USArrests_pve %>%
ggplot(aes(x = rowname, y = cusum)) +
geom_line(type = 3) +
xlab("Principal Component") +
ylab("Proportion of Variance Explained") +
ggtitle("Cumulative Proportion of Variance Explained for US Arrests Data")
NCI60 Data Example
- The NCI-60 cancer cell line panel is a group of 60 human cancer cell lines used by the National Cancer Institute (NCI) for the screening of compounds to detect potential anticancer activity.
- Q: Why would we care about dimension reduction here? (what’s a “case” in the data?)
Inspecting the data
require(ISLR)
data("NCI60")
cancerLabels <- NCI60$labs
# recode immortalized cell lines
cancerLabels <- ifelse(test = grepl(pattern = "MCF7", x = cancerLabels),
yes = "BREAST", no = cancerLabels)
cancerLabels <- ifelse(test = grepl(pattern = "K562", x = cancerLabels),
yes = "LEUKEMIA", no = cancerLabels)
nciData <- NCI60$data
# dimensions of the data
nrow(nciData) # cases
[1] 64
ncol(nciData) # variables
[1] 6830
# inspect NCI cancer labels
head(cancerLabels)
[1] "CNS" "CNS" "CNS" "RENAL" "BREAST" "CNS"
# distribution of cancer cell lines available
tally(cancerLabels ~ 1)
1
cancerLabels 1
BREAST 9
CNS 5
COLON 7
LEUKEMIA 8
MELANOMA 8
NSCLC 9
OVARIAN 6
PROSTATE 2
RENAL 9
UNKNOWN 1
Principal components analysis of NCI60 data
- could make a case either way for standardizing variables since all metrics are on same scale in this case
- Q: How many total principal components possible?
- We have 64 PC’s this time
- US Arrests had 4 PC’s
# perform pca on scaled genes
NCI_pca <- nciData %>%
prcomp(scale = TRUE)
# the result is a list object
str(NCI_pca)
List of 5
$ sdev : num [1:64] 27.9 21.5 19.8 17 16 ...
$ rotation: num [1:6830, 1:64] -0.01068 -0.00231 -0.00588 0.00328 -0.00768 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:6830] "1" "2" "3" "4" ...
.. ..$ : chr [1:64] "PC1" "PC2" "PC3" "PC4" ...
$ center : Named num [1:6830] -0.0191 -0.0278 -0.0199 -0.3287 0.0261 ...
..- attr(*, "names")= chr [1:6830] "1" "2" "3" "4" ...
$ scale : Named num [1:6830] 0.441 0.757 0.433 1.092 0.485 ...
..- attr(*, "names")= chr [1:6830] "1" "2" "3" "4" ...
$ x : num [1:64, 1:64] -19.7 -22.9 -27.2 -42.5 -55 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:64] "V1" "V2" "V3" "V4" ...
.. ..$ : chr [1:64] "PC1" "PC2" "PC3" "PC4" ...
- attr(*, "class")= chr "prcomp"
First few principal components of NCI60 data
# plot PC1 vs PC2
NCI_pca$x %>%
as.data.frame() %>% # `ggplot2` expects a data frame object
ggplot(aes(x = PC1, y = PC2)) +
geom_point(aes(color = cancerLabels), size = 3) +
xlab("Best Vector from PCA") +
ylab("Second Best Vector from PCA") +
ggtitle("Two-dimensional representation of 6830 Genes (colored by actual cancer type)")

# plot of PC1 vs PC3
NCI_pca$x %>%
as.data.frame() %>% # `ggplot2` expects a data frame object
ggplot(aes(x = PC1, y = PC3)) +
geom_point(aes(color = cancerLabels), size = 3) +
xlab("Best Vector from PCA") +
ylab("Third Best Vector from PCA") +
ggtitle("Two-dimensional representation of 6830 Genes (colored by actual cancer type)")

# SD and variance explained by each PC
summary(NCI_pca)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11
Standard deviation 27.853 21.4814 19.8205 17.0326 15.9718 15.7211 14.4715 13.5443 13.1440 12.7386 12.6867
Proportion of Variance 0.114 0.0676 0.0575 0.0425 0.0374 0.0362 0.0307 0.0269 0.0253 0.0238 0.0236
Cumulative Proportion 0.114 0.1812 0.2387 0.2812 0.3185 0.3547 0.3853 0.4122 0.4375 0.4613 0.4848
PC12 PC13 PC14 PC15 PC16 PC17 PC18 PC19 PC20 PC21 PC22
Standard deviation 12.1577 11.8302 11.6255 11.4378 11.0005 10.6567 10.4888 10.4352 10.3219 10.1461 10.0544
Proportion of Variance 0.0216 0.0205 0.0198 0.0192 0.0177 0.0166 0.0161 0.0159 0.0156 0.0151 0.0148
Cumulative Proportion 0.5065 0.5270 0.5467 0.5659 0.5836 0.6002 0.6163 0.6323 0.6479 0.6630 0.6778
PC23 PC24 PC25 PC26 PC27 PC28 PC29 PC30 PC31 PC32 PC33 PC34
Standard deviation 9.9027 9.6477 9.5076 9.3325 9.2732 9.0900 8.9812 8.7500 8.5996 8.4474 8.3730 8.21579
Proportion of Variance 0.0144 0.0136 0.0132 0.0127 0.0126 0.0121 0.0118 0.0112 0.0108 0.0104 0.0103 0.00988
Cumulative Proportion 0.6921 0.7057 0.7190 0.7317 0.7443 0.7564 0.7682 0.7794 0.7903 0.8007 0.8110 0.82087
PC35 PC36 PC37 PC38 PC39 PC40 PC41 PC42 PC43 PC44 PC45
Standard deviation 8.15731 7.97465 7.90446 7.82127 7.72156 7.58603 7.45619 7.3444 7.10449 7.0131 6.95839
Proportion of Variance 0.00974 0.00931 0.00915 0.00896 0.00873 0.00843 0.00814 0.0079 0.00739 0.0072 0.00709
Cumulative Proportion 0.83061 0.83992 0.84907 0.85803 0.86676 0.87518 0.88332 0.8912 0.89861 0.9058 0.91290
PC46 PC47 PC48 PC49 PC50 PC51 PC52 PC53 PC54 PC55 PC56
Standard deviation 6.8663 6.80744 6.64763 6.61607 6.40793 6.21984 6.20326 6.06706 5.91805 5.91233 5.73539
Proportion of Variance 0.0069 0.00678 0.00647 0.00641 0.00601 0.00566 0.00563 0.00539 0.00513 0.00512 0.00482
Cumulative Proportion 0.9198 0.92659 0.93306 0.93947 0.94548 0.95114 0.95678 0.96216 0.96729 0.97241 0.97723
PC57 PC58 PC59 PC60 PC61 PC62 PC63 PC64
Standard deviation 5.47261 5.2921 5.02117 4.68398 4.17567 4.08212 4.04124 2.15e-14
Proportion of Variance 0.00438 0.0041 0.00369 0.00321 0.00255 0.00244 0.00239 0.00e+00
Cumulative Proportion 0.98161 0.9857 0.98940 0.99262 0.99517 0.99761 1.00000 1.00e+00
# proportion of variance explained (PVE) of each PC
NCI_pve <-
data.frame(sd = NCI_pca$sdev) %>%
rownames_to_column() %>%
mutate(rowname = parse_number(rowname),
totalVar = sum(NCI_pca$sdev^2),
pve = 100 * sd^2 / totalVar,
cusum = cumsum(pve))
# scree plot
NCI_pve %>%
ggplot(aes(x = rowname, y = pve)) +
geom_line(type = 3) +
xlab("Principal Component") +
ylab("Proportion of Variance Explained") +
ggtitle("Scree Plot of Principal Components for NCI60 Data")
Ignoring unknown parameters: type

# cumulative PVE plot
NCI_pve %>%
ggplot(aes(x = rowname, y = cusum)) +
geom_line(type = 3) +
xlab("Principal Component") +
ylab("Proportion of Variance Explained") +
ggtitle("Cumulative Proportion of Variance Explained for NCI60 Data")
Ignoring unknown parameters: type

Clustering
- Both principal components and clustering seek to simplify the data via a small number of summaries, but the mechanisms differ
- PCA (& SVD) attempts to find a low-dimensional representation of the observations that explain a good fraction of the variance
- Clustering looks to find subgroups or define similarity among the observations
- Clustering algorithms impart organizing structure for describing degrees of similarity between different things
- Hierarchical clustering does not specify a desired number of clusters, instead maps similarity among all n (distinct) observations
- a dendogram: a tree-based organizing structure for describing/visualizing those degrees of similarity,
- doesn’t matter how those relationships came to be
- The tree may or may not reflect some deeper relationship among the objects measured
- Looks like a decision tree, but it approaches the problem from the opposite direction
- ** K-means** clustering seeks to partition observations into a pre-specified number of (K) clusters
Hierarchical clustering
- appropriate when cases are decribed by a set of numerical variables (none of which is a response)
- Method (agglomorative/bottom-up clustering):
- Begin with n cases each measured as a point in Cartesian space and calculate all \({n\choose2} = \frac{n(n-1)}{2}\) pairwise dissimilarities. Treat each observation as its own cluster.
- For \(i = n, (n-1), ..., 2\): A. Examine all pairwise inter-cluster dissimilarities among the i clusters and fuse the two clusters that are least dissimilar.
B. Compute new pairwise inter-cluster dissimilarities among the \(i-1\) remaining clusters
- for an individual quantitatve variable, it’s easy to measure distance between cases
- with multiple variables, we need to adjust for different scales and units
- no “best” solution… usually requires domain expertise
- with no other information, Euclidean distance is a sensible default (but there are others)
Distance & Linkage:
- multiple linkages are available for calculating inter-cluster dissimilarities.
- most common linkages (balanced dendograms)
- complete linkage: largest inter-cluster dissimilarity
- average linkage: mean inter-cluster dissimilarity
- less common linkages
- single linkage: smallest inter-cluster dissimilarity
- centroid linkage: dissimilarity among the centroids of clusters
Hierarchical clustering
- The tree is “grown” in reverse–from bottom to top
- The dissimilarity between clusters indicates the height in the dendogram at which the fusion should be placed.
- Q: which is more similar to obs #2?
- Q: which pair of observations is more similar?
- Where might we “cut” the dendogram to define clusters?
Cluster analysis of NCI60 data
- the rescaled variables inherited unfortunate names (just a column number)
- Q: What means & sd’s do you expect for variables
1
and 2
? (does it match?)
- clustering algorithm
- calculate all point-to-point distances (e.g., pairwise dissimilarities)
- begin by treating each point as a cluster
- iteratively fuse clusters that are least dissimilar according to linkage chosen
Hierarchical Clustering
# scale the data (centered with SD 1)
NCI_std <-
scale(nciData) %>%
as.data.frame()
# the variables have inherited some unfortunate names (just column number)
favstats(~ `1`, data = NCI_std)
favstats(~ `2`, data = NCI_std)
NCI_dist <- dist(NCI_std)
# # plot dedrogram (average linkage)
# NCI_dist %>%
# hclust(method = "average") %>%
# plot(cex = 0.9, labels = cancerLabels, main = "NCI60 Dendogram with Average Linkage")
# construct dendogram (complete linkage)
NCI_dendo <-
NCI_dist %>%
hclust(method = "complete")
# print dendogram info (distance & method)
print(NCI_dendo)
Call:
hclust(d = ., method = "complete")
Cluster method : complete
Distance : euclidean
Number of objects: 64
# plot dedrogram (complete linkage)
NCI_dendo %>%
plot(cex = 0.9, labels = cancerLabels, lwd = 2,
main = "NCI60 Dendogram with Complete Linkage")

Cut Dendogram to define clusters
- Complete dendogram can be used to produce clusters
- establish cut point based on dissimilarity index (vertical axis)
- software can choose cut based on requested number of clusters
- Q: Did we learn anything from our clusters?
# Cut dendogram to produce clusters
NCI_dendo %>%
plot(labels = cancerLabels, lwd = 2,
main = "NCI60 Dendogram with Complete Linkage (5 clusters)") %>%
abline(h = 135, col = "red", lwd = 3)

# Cut dendogram--Hierarchical clusters
NCI_DendoClusters <- cutree(tree = NCI_dendo, k = 5)
# clustering patterns (Leukemia & melanoma; not so much breast)
tally(cancerLabels ~ NCI_DendoClusters)
NCI_DendoClusters
cancerLabels 1 2 3 4 5
BREAST 0 3 0 4 2
CNS 3 2 0 0 0
COLON 2 0 0 5 0
LEUKEMIA 0 0 8 0 0
MELANOMA 2 0 0 0 6
NSCLC 7 1 0 0 1
OVARIAN 6 0 0 0 0
PROSTATE 2 0 0 0 0
RENAL 8 1 0 0 0
UNKNOWN 1 0 0 0 0
K-means clustering
- Goal: partition the observations into a pre-specified number of (K) non-overlapping clusters
- minimize within-cluster variation
- each observation is assigned to exactly one cluster
- similar to classification, but there’s no response variable, so meaning of clusters is inferred implicitly
- Method (see figure):
- Randomly assign each of the observations to clusters 1 through K
- Iterate until cluster assignments stop changing: A. For each of the K clusters, compute the cluster centroid (vector of p feature means for the observations in the kth cluster) B. Assign each observation to the cluster whose centroid is closest (e.g., in Euclidean distance)
- (strongly recommended) Run algorithm multiple times from different random initial configurations to temper impact of randomness in step 1. Argument
nstart
is available in kmeans()
function for this purpose.
- Cluster interpretation:
- remember this is part of EDA to understand structure in our data
- plot the clusters
- investigate summary statistics for the clusters
K-means Clustering
- Suppose we consider k-means with 5 clusters
- Q: How does result compare with our Hier. Clust. Dendogram?
set.seed(2)
# perform kmeans clustering (k = 5 clusters)
NCI_kmean <-
NCI_std %>%
kmeans(centers = 5, nstart = 20)
# what are we working with
str(NCI_kmean)
List of 9
$ cluster : Named int [1:64] 2 2 2 2 2 2 2 2 2 2 ...
..- attr(*, "names")= chr [1:64] "V1" "V2" "V3" "V4" ...
$ centers : num [1:5, 1:6830] 0.0205 0.2232 0.5643 -0.2946 -0.4695 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:5] "1" "2" "3" "4" ...
.. ..$ : chr [1:6830] "1" "2" "3" "4" ...
$ totss : num 430290
$ withinss : num [1:5] 37150 154545 10326 82235 44071
$ tot.withinss: num 328326
$ betweenss : num 101964
$ size : int [1:5] 9 27 4 16 8
$ iter : int 3
$ ifault : int 0
- attr(*, "class")= chr "kmeans"
# compare Hierarchical Clusters with K-Means Clusters
NCI_KMeanClusters <- NCI_kmean$cluster
# both methods match for one cluster, but others are noisier
tally(NCI_DendoClusters ~ NCI_KMeanClusters)
NCI_KMeanClusters
NCI_DendoClusters 1 2 3 4 5
1 1 20 0 10 0
2 0 7 0 0 0
3 0 0 0 0 8
4 0 0 4 5 0
5 8 0 0 1 0
Hierarchical Clustering on first 7 principal components
- How might we combine methods?
- PCA for dimension reduction
- Cluster to assess similarity
NCI_pca_hcluster <-
NCI_pca$x[, 1:7] %>%
dist() %>%
hclust()
# plot
NCI_pca_hcluster %>%
plot(labels = cancerLabels, lwd = 2,
main = "Hierarcical Clustering on First Seven Principal Components")

tally(cancerLabels ~ cutree(NCI_pca_hcluster, k = 7))
cutree(NCI_pca_hcluster, k = 7)
cancerLabels 1 2 3 4 5 6 7
BREAST 2 1 0 0 0 4 2
CNS 4 1 0 0 0 0 0
COLON 0 0 7 0 0 0 0
LEUKEMIA 0 0 0 6 2 0 0
MELANOMA 1 0 0 0 0 0 7
NSCLC 5 1 3 0 0 0 0
OVARIAN 5 1 0 0 0 0 0
PROSTATE 2 0 0 0 0 0 0
RENAL 8 1 0 0 0 0 0
UNKNOWN 0 1 0 0 0 0 0
Practical issues in clustering
- Decisions to be made…
- Standardize variables?
- Hierarchical Clustering decisions
- Which dissimilarity measure?
- What type of linkage?
- Where might we “cut” the dendogram to define clusters?
- K-means decision
- Integrity of the clusters obtained
- hard to validate clusters
- no consensus on assessing whether cluster is artifact of chance (e.g. p-value)
- sensitive to extreme observations (and multivariate outliers aren’t always easy to spot)
Recommendations
- experiment with different choices of linkage, standardized/not, etc, and look for patterns or structures that consistently emerge
- cluster random subsets of the data to get sense of robustness to outliers
- most importantly, be careful when reporting results of cluster analysis
- not absolute truth about the data (much less the population)
- it’s more of a starting point to generate scientific questions for study on (ideally) independent data
---
title: "Unsupervised Learning"
subtitle: "MDSR Ch 9 & ISLR Ch 10"
output: 
  slidy_presentation: default
  html_notebook: default  
---



```{r Front Matter, echo=TRUE, message=FALSE, warning=FALSE, include=FALSE}
# clean up R environment
rm(list = ls())

# global options
knitr::opts_chunk$set(eval=TRUE, include=TRUE)
options(digits=4)

# packages used
library(mdsr)
library(tidyverse)
library(ISLR)
library(tibble)
library(ggdendro)


# user-defined functions 


# inputs summary
data("USArrests")
data("NCI60")

```


## Agenda


#### Announcements

- MDSR Ch 9 programming notebook assigned
- MDSR Ch 8 & 9 Exercises assigned
- midterm exam
    - mix of in-class and take-home
    - in-class portion Wed 2/27
    - take-home portion due Fri 3/1 at 11:59pm
    - class will not meet on Friday 3/1 

#### MDSR Ch 9 Errata / Tips

- Some sections don't require programming, but please still include the headers for navigation purposes
- p. 206: the `destfile` argument probably won't work for you as written.  You can
    1. (recommended) create a folder called "data" in the same directory as your .Rmd file
    2. delete `data/` and use `destfile = fueleconomy.zip` instead.  You'll need to update subsequent functions on p. 206 and 207 in kind when attempting to access files in a subdirectory called `data`
- Section 9.2.2: my results aren't identical to those in the book... maybe I made a mistake or maybe the data has changed.  Namely, the clusters are similar but don't *exactly* match on close inspection.  


## Statistical Learning (recall)

Q: What are some differences between **Supervised** and **Unsupervised** Learning?



## Statistical Learning (recall)

- statistical learning refers to a vast set of tools for understanding data
- Typical distinction between **Supervised** and **Unsupervised** learning
    - **Supervised** learning predictive or inferential modeling
        - there is a response variable (Y)
        - predictive modeling: we want to anticipate the response (Y) ahead of time based on knowledge of a set of measurable inputs (X's)
        - inferential modeling: we want to understand the way our response (Y) changes as the explanatory variables (X's) change
    - **Unsupervised** learning methods could be considered "data discovery" models 
        - there is NO response variable (Y)
        - we are interested in exposing interesting relationships/groups/clusters among several explanatory variables (X's)


## Unsupervised Learning 

- Often considered within the domain of Exploratory Data Analysis
    - Search for useful structure among explanatory variables: X1, X2, ...,Xp. 
    - Informative way to visualize the data? 
    - Subgroups among the variables or among the observations?  
- **Dimension reduction**
    - express low-dimensional representation of the data that explains a good fraction of the variance
    - commonly used for data visualization or pre-processing before supervised learning techniques are applied
    - Principle component analysis (PCA)--discussed here
    - Singular value decomposition (SVD)--see MDSR Ch 9, for example
- **Clustering**
    - broad class of methods for **discovering unknown subgroups** within data
    - Hierarchical clustering: 
    - k-means: specify number of groups
- Q: when would this be useful?


## Challenge of Unsupervised Learning

- No "groud truth"
- in supervised learning we can check our work by evaluating predictions using cross-validation, independent test set, etc.
- typically no mechanism to "check our work" in unsupervised learning because the true answer is unknown


## Some Examples 

- A cancer researcher might assay gene expression levels in 100 patients with breast cancer. He or she might then look for subgroups among the breast cancer samples, or among the genes, in order to obtain a better understanding of the disease. 

- A search engine (e.g., Google) might choose what search results to display to a particular individual based on the click histories of other individuals with similar search patterns. 

- Mapping evolutionary relationships among various biological species or other entities--their phylogeny--based upon similarities and differences in their physical or genetic characteristics.


## Principal Components Analysis (PCA)

- **Goal**: summarize a large set of correlated variables with a smaller number of representative variables that collectively explain most of the variability in the original data set
- **Intuition**:
    - many of our variables are correlated or possibly redundant
    - perhaps we can rotate our data to identify 
        1. (PC1) the axis that maximizes variability (first principal component--PC1)
        2. (PC2) an axis perpendicular to PC1 that maximizes the remaining variability
        3. and so on... until we explain all of variability in the data
    - hopefully, we can explain most (or a lot) of the total variability with only the **first few** principal components
    - sometimes we can even attempt to interpret how principal components associate (i.e., "load") with respect to other variables in the data 
- **Method**: PCA can be done by eigenvalue decomposition of a data covariance (or correlation) matrix or singular value decomposition (SVD) of a data matrix, usually after a normalization step of the initial data. 

![image credit: James et al (2013) <http://www-bcf.usc.edu/~gareth/ISL/> Fig 6.14](pca-ISLR-6-14.png)

## US Arrests by State

- we want to compare states on these four variables 
    - Murder
    - Assault
    - Rape
    - UrbanPop (% of state living in urban areas)
- Q: which variable(s) would you expect to have the largest **variance**

```{r}
data("USArrests")
head(USArrests)
```

## PCA: US Arrest Data

- measurement units & scales significantly impact variance
- we often want to standardize the variables first 
    - mean = 0; sd = 1 (convert measurements to z-scores)
    - this amounts to giving each variable equal weight
    - mutes effect of unit conversion (km to meters impacts variance estimate)
- might not standardize if all variables are measured on same scale


#### First two principal components

- Q: How do PC1 and PC2 associate with our four variables? 
- Note: See what happens if we DON'T standardize variables...

```{r}
USArrests_pca <- USArrests %>%
  prcomp(scale = TRUE)  # standardize the variables

# the result is a list object
str(USArrests_pca)

# first two principal components
(-1) * USArrests_pca$rotation[, 1:2] %>% round(2)
```

```{r}
# plot of the first two principal components
USArrests_pca$x %>%
  as.data.frame() %>%  # `ggplot2` expects a data frame object
  rownames_to_column() %>%
  ggplot(aes(x = -PC1, y = -PC2)) + 
  geom_text(aes(label = rowname), size = 3) + 
  xlab("Best Vector from PCA (approx. Violent Crime)") + 
  ylab("2nd Best Vector from PCA (approx. Urbanization)") + 
  ggtitle("Two-dimensional representation of US Arrests by State")

```


## Proportion of variance explained 

- (recall) **Goal**: summarize a large set of correlated variables with a smaller number of representative variables that collectively explain most of the variability in the original data set
- In general, a $n \times p$ data matrix **X** has $\text{min}(n-1, p)$ distince principal components
    - we want smallest number of PC's to get a good understanding of the data
    - there isn't really an optimal solution to this problem...
- **Proportion of variability explained (PVE)**
    - Assess PVE for each of our principal component vectors
    - Can we find a point of diminishing return among principal components
        - we want fewest number of PC's that still do a good job representing the data
        - informally, we look for an "elbow" in the **scree plot**

![By Kevin Lenz (talk contribs) - Own work, CC BY-SA 2.5, <https://commons.wikimedia.org/w/index.php?curid=1239742>](scree.jpg)

## Scree Plot (US Arrest Data)

- Q: How did we do?
- Q: How many principal components do you think we should consider?


```{r eval=FALSE}
USArrests_pve <- 
  data.frame(sd = USArrests_pca$sdev) %>%
  rownames_to_column() %>%
  mutate(rowname = parse_number(rowname), 
         totalVar = sum(USArrests_pca$sdev^2), 
         pve = 100 * sd^2 / totalVar, 
         cusum = cumsum(pve))

# scree plot
USArrests_pve %>%
  ggplot(aes(x = rowname, y = pve)) + 
  geom_line(type = 3) + 
  xlab("Principal Component") + 
  ylab("Proportion of Variance Explained") + 
  ggtitle("Scree Plot of Principal Components for US Arrests Data") 

  
# cumulative PVE plot
USArrests_pve %>%
  ggplot(aes(x = rowname, y = cusum)) + 
  geom_line(type = 3) + 
  xlab("Principal Component") + 
  ylab("Proportion of Variance Explained") + 
  ggtitle("Cumulative Proportion of Variance Explained for US Arrests Data") 

```



## NCI60 Data Example

- The NCI-60 cancer cell line panel is a group of 60 human cancer cell lines used by the National Cancer Institute (NCI) for the screening of compounds to detect potential anticancer activity.
    - <https://en.wikipedia.org/wiki/NCI-60>
    - 64 cancer cell lines
    - 6830 gene expression measurements
- Q: Why would we care about dimension reduction here? (what's a "case" in the data?)



## Inspecting the data

```{r}
require(ISLR)
data("NCI60")

cancerLabels <- NCI60$labs 

# recode immortalized cell lines
cancerLabels <- ifelse(test = grepl(pattern = "MCF7", x = cancerLabels),
                       yes = "BREAST", no = cancerLabels)
cancerLabels <- ifelse(test = grepl(pattern = "K562", x = cancerLabels),
                       yes = "LEUKEMIA", no = cancerLabels)

nciData <- NCI60$data 

# dimensions of the data
nrow(nciData)  # cases
ncol(nciData)  # variables

# inspect NCI cancer labels
head(cancerLabels)

# distribution of cancer cell lines available
tally(cancerLabels ~ 1)

```

<!-- Day2 -->

## Principal components analysis of NCI60 data

- could make a case either way for standardizing variables since all metrics are on same scale in this case
- Q: How many total principal components possible?
    - We have 64 PC's this time
    - US Arrests had 4 PC's

```{r}
# perform pca on scaled genes
NCI_pca <- nciData %>%
  prcomp(scale = TRUE)  

# the result is a list object
str(NCI_pca)
```


## First few principal components of NCI60 data

```{r}
# plot PC1 vs PC2
NCI_pca$x %>%
  as.data.frame() %>%  # `ggplot2` expects a data frame object
  ggplot(aes(x = PC1, y = PC2)) + 
  geom_point(aes(color = cancerLabels), size = 3) + 
  xlab("Best Vector from PCA") + 
  ylab("Second Best Vector from PCA") + 
  ggtitle("Two-dimensional representation of 6830 Genes (colored by actual cancer type)") 


# plot of PC1 vs PC3
NCI_pca$x %>%
  as.data.frame() %>%  # `ggplot2` expects a data frame object
  ggplot(aes(x = PC1, y = PC3)) + 
  geom_point(aes(color = cancerLabels), size = 3) + 
  xlab("Best Vector from PCA") + 
  ylab("Third Best Vector from PCA") +
  ggtitle("Two-dimensional representation of 6830 Genes (colored by actual cancer type)")
```



```{r}
# SD and variance explained by each PC
summary(NCI_pca)

# proportion of variance explained (PVE) of each PC
NCI_pve <- 
  data.frame(sd = NCI_pca$sdev) %>%
  rownames_to_column() %>%
  mutate(rowname = parse_number(rowname), 
         totalVar = sum(NCI_pca$sdev^2), 
         pve = 100 * sd^2 / totalVar, 
         cusum = cumsum(pve))

# scree plot
NCI_pve %>%
  ggplot(aes(x = rowname, y = pve)) + 
  geom_line(type = 3) + 
  xlab("Principal Component") + 
  ylab("Proportion of Variance Explained") + 
  ggtitle("Scree Plot of Principal Components for NCI60 Data") 

  
# cumulative PVE plot
NCI_pve %>%
  ggplot(aes(x = rowname, y = cusum)) + 
  geom_line(type = 3) + 
  xlab("Principal Component") + 
  ylab("Proportion of Variance Explained") + 
  ggtitle("Cumulative Proportion of Variance Explained for NCI60 Data") 

```






## Clustering

- Both principal components and clustering seek to simplify the data via a small number of summaries, but the mechanisms differ
    - PCA (& SVD) attempts to find a low-dimensional representation of the observations that explain a good fraction of the variance
    - Clustering looks to find subgroups or define similarity among the observations
- **Clustering** algorithms impart organizing structure for describing degrees of similarity between different things
    - **Hierarchical clustering** does not specify a desired number of clusters, instead maps similarity among all *n* (distinct) observations
    - a **dendogram**: a tree-based organizing structure for describing/visualizing those degrees of similarity,
      - doesn't matter how those relationships came to be
      - The tree may or may not reflect some deeper relationship among the objects measured
      - *Looks* like a decision tree, but it approaches the problem from the opposite direction
    - ** *K*-means** clustering seeks to partition observations into a pre-specified number of (K) clusters


![image credit: James et al (2013) <http://www-bcf.usc.edu/~gareth/ISL/> Fig 10.10](dendogram-ISLR-10-10.png)

## Hierarchical clustering

- appropriate when cases are decribed by a set of numerical variables (none of which is a *response*)
- Method (agglomorative/bottom-up clustering):
    1. Begin with *n* cases each measured as a point in Cartesian space and calculate all ${n\choose2} = \frac{n(n-1)}{2}$ pairwise dissimilarities.  **Treat each observation as its own cluster.**
    2. For $i = n, (n-1), ..., 2$:
        A. Examine all pairwise inter-cluster dissimilarities among the *i* clusters and fuse the two clusters that are **least dissimilar**.  
        B. Compute new pairwise inter-cluster dissimilarities among the $i-1$ remaining clusters
- for an individual quantitatve variable, it's easy to measure distance between cases
- with multiple variables, we need to adjust for different scales and units
    - no "best" solution... usually requires domain expertise
    - with no other information, Euclidean distance is a sensible default (but there are others)

![image credit: James et al (2013) <http://www-bcf.usc.edu/~gareth/ISL/> Fig 10.11](hclust-ISLR-10-11.png)


## Distance & Linkage: 

![image credit: James et al (2013) <http://www-bcf.usc.edu/~gareth/ISL/> Fig 10.11](hclust-ISLR-10-11.png)

- multiple linkages are available for calculating inter-cluster dissimilarities. 
- most common linkages (balanced dendograms)
    - complete linkage: largest inter-cluster dissimilarity
    - average linkage: mean inter-cluster dissimilarity
- less common linkages
    - single linkage: smallest inter-cluster dissimilarity 
    - centroid linkage: dissimilarity among the centroids of clusters 


## Hierarchical clustering

- The tree is "grown" in reverse--from bottom to top
- The **dissimilarity** between clusters indicates the height in the dendogram at which the fusion should be placed.
    - Q: which is more similar to obs #2?
        - Obs #7
        - Obs #9
    - Q: which pair of observations is more similar?
        - {3, 6}
        - {9, 2}
- Where might we "cut" the dendogram to define clusters?


![image credit: James et al (2013) <http://www-bcf.usc.edu/~gareth/ISL/> Fig 10.10](dendogram-ISLR-10-10.png) 

<!-- Day3 -->

## Cluster analysis of NCI60 data

- the rescaled variables inherited unfortunate names (just a column number)
    - Q: What means & sd's do you expect for variables `1` and `2`? (does it match?)
- clustering algorithm
    - calculate all point-to-point distances (e.g., pairwise dissimilarities)
    - begin by treating each point as a cluster
    - iteratively fuse clusters that are least dissimilar according to linkage chosen


##### Hierarchical Clustering

```{r fig.height=6, fig.width=8}
# scale the data (centered with SD 1)
NCI_std <-
  scale(nciData) %>%
  as.data.frame()

# the variables have inherited some unfortunate names (just column number)
favstats(~ `1`, data = NCI_std)
favstats(~ `2`, data = NCI_std)

NCI_dist <- dist(NCI_std)

# # plot dedrogram (average linkage)
# NCI_dist %>%
#   hclust(method = "average") %>%
#   plot(cex = 0.9, labels = cancerLabels, main = "NCI60 Dendogram with Average Linkage")

# construct dendogram (complete linkage)
NCI_dendo <-
  NCI_dist %>%
  hclust(method = "complete")

# print dendogram info (distance & method)
print(NCI_dendo)

# plot dedrogram (complete linkage)
NCI_dendo %>%
  plot(cex = 0.9, labels = cancerLabels, lwd = 2,
       main = "NCI60 Dendogram with Complete Linkage")

```



##### Cut Dendogram to define clusters

- Complete dendogram can be used to produce clusters
- establish cut point based on dissimilarity index (vertical axis)
- software can choose cut based on requested number of clusters
- Q: Did we learn anything from our clusters?

```{r fig.height=6, fig.width=8}
# Cut dendogram to produce clusters
NCI_dendo %>%
  plot(labels = cancerLabels, lwd = 2,
       main = "NCI60 Dendogram with Complete Linkage (5 clusters)") %>%
  abline(h = 135, col = "red", lwd = 3)

# Cut dendogram--Hierarchical clusters
NCI_DendoClusters <- cutree(tree = NCI_dendo, k = 5)


# clustering patterns (Leukemia & melanoma; not so much breast)
tally(cancerLabels ~ NCI_DendoClusters)


```



## K-means clustering

- Goal: partition the observations into a pre-specified number of (*K*) non-overlapping clusters
    - minimize within-cluster variation
    - each observation is assigned to exactly one cluster
    - similar to classification, but there's no response variable, so meaning of clusters is inferred implicitly
- Method (see figure):
    1. Randomly assign each of the observations to clusters 1 through K
    2. Iterate until cluster assignments stop changing:
        A. For each of the *K* clusters, compute the cluster *centroid* (vector of *p* feature means for the observations in the *k*th cluster)
        B. Assign each observation to the cluster whose centroid is closest (e.g., in Euclidean distance)
    3. (strongly recommended) Run algorithm multiple times from different random initial configurations to temper impact of randomness in step 1.  Argument `nstart` is available in `kmeans()` function for this purpose.
- Cluster interpretation:
    - remember this is part of EDA to understand structure in our data
    - plot the clusters
    - investigate summary statistics for the clusters

![image credit: James et al (2013) <http://www-bcf.usc.edu/~gareth/ISL/> Fig 10.6](kmeans-ISLR-10-6.png)


## K-means Clustering

- Suppose we consider k-means with 5 clusters
- Q: How does result compare with our Hier. Clust. Dendogram?

```{r}
set.seed(2)

# perform kmeans clustering (k = 5 clusters)
NCI_kmean <-
  NCI_std %>%
  kmeans(centers = 5, nstart = 20)

# what are we working with
str(NCI_kmean)

# compare Hierarchical Clusters with K-Means Clusters
NCI_KMeanClusters <- NCI_kmean$cluster

# both methods match for one cluster, but others are noisier
tally(NCI_DendoClusters ~ NCI_KMeanClusters)

```


## Hierarchical Clustering on first 7 principal components

- How might we combine methods?
    - PCA for dimension reduction
    - Cluster to assess similarity

```{r fig.height=4, fig.width=6}
NCI_pca_hcluster <-
  NCI_pca$x[, 1:7] %>%
  dist() %>%
  hclust()

# plot
NCI_pca_hcluster %>%
  plot(labels = cancerLabels, lwd = 2,
       main = "Hierarcical Clustering on First Seven Principal Components")

tally(cancerLabels ~ cutree(NCI_pca_hcluster, k = 7))
```



## Practical issues in clustering

- Decisions to be made...
    - Standardize variables?
    - Hierarchical Clustering decisions
        - Which dissimilarity measure?
        - What type of linkage?
        - Where might we "cut" the dendogram to define clusters?
    - K-means decision
        - how to choose K?
- Integrity of the clusters obtained
    - hard to validate clusters
    - no consensus on assessing whether cluster is artifact of chance (e.g. p-value)
    - sensitive to extreme observations (and multivariate outliers aren't always easy to spot)

#### Recommendations

- experiment with different choices of linkage, standardized/not, etc, and look for patterns or structures that consistently emerge
- cluster random subsets of the data to get sense of robustness to outliers
- most importantly, be careful when reporting results of cluster analysis
    - not absolute truth about the data (much less the population)
    - it's more of a starting point to generate scientific questions for study on (ideally) independent data



